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[57] ABSTRACT 

A microprocessor of the superscalar pipelined type, having 
speculative execution capability, is disclosed. Speculative 
execution is under the control of a fetch unit having a branch 
target buffer and a return address stack, each having multiple 
entries. Each entry includes an address value corresponding 
to the destination of a branching instruction, and an associ- 
ated register value, such as a stack pointer. Upon the 
execution of a subroutine call, the return address and current 
stack pointer value are stored in the return address stack, to 
allow for fetching and speculative execution of the sequen- 
tial instructions following the call in the calling program. 
Any branching instruction, such as the call, return, or 
conditional branch, will have an entry included in the branch 
target buffer; upon fetch of the branch on later passes, 
speculative execution from the target address can begin 
using the stack pointer value stored speculatively in the 
branch target buffer in association with the target address. 

25 Claims, 4 Drawing Sheets 



26 



FA 



FA 



22 



INSTR 
/iTLB 



59- 



TO/FROM |f « 
INSTRUCTION J 
LI CACHE 1 
16; l> 



PA 



7 



INSTR 


CS 


BRK 


LIMIT 


CHECK 


CHECK 


62 


64 





CODE 




ASTR 



TO PRECODE 0 STAGE 
28 



TO 


DO 


BSPO 


T1 


01 


BSP1 


s : 






T125 


0125 


BSP125 


T126 


0126 


BSP126 


T127 


0127 


BSP127 



SPi 



IN 



-56 



55 



RASTK3 


RSP3 


RASTK2 


RSP2 


RASTK1 


RSP1 


RASTK0 


RSPO 



SPOUT 



58 



52 



FROM TO 
REGISTER PREDEC0DE 
FILE 39 0 STATE 28 



FROM 
EX 



50 



FETCH 
PTR 

T 

INC 



03/10/2004, EAST Version: 1.4.1 



U.S. Patent 



Dec. 15, 1998 



Sheet 1 of 4 



5,850,543 



3 

A. 



COMM 
PORTS 



A. 



GRAPHICS 
DISPLAY 
SYSTEM 



5 



MAIN 
MEMORY 
SUBSYSTEM 



STACK 



I 



jL 



8 



INPUT 
DEVICES 



DISK 
SYSTEM 



12- 



14- 



I 




BUS INTERFACE UNIT 



I 



LEVEL 2 CACHE 



39 



16 d - 



18- 



LEVEL 1 
DATA CACHE 



I 



REG 
FILE 



MICROCACHE 
7T 




40 0 



7 

38 



ni 



LOAD 
CCH STORE 



30 



FLOATING- 
POINT 
UNIT 



1 



I 





CLOCK 




GENERATION 




AND 




CONTROL 



20 



LEVEL 1 
INSTRUCTION 
CACHE . 



I 



16; 



26 



FETCH UNIT fC=0 



28 



22 

J_ 



INSTRUCTION 
//TLB 



PREDECODE 0 



40i 42 0 
^ J J 



LOAD 
STORE 
1 



ALU 
0 



ALU 
1 



44- 



45- 



OPERAND UNIT 



I 



MULTIPLEXER 



I 



HI 



JTAG AND 
BIST 



PREDECODE 1 



MI 



DECODE 



MM 



-32 
-34 



V 

24 






MICROCODE 




MICRO- 


46- / ' 


ROM 




SEQUENCER 



10 



•48 



FIG. 1 



03/10/2004, EAST version: 1.4.1 



U.S. Patent 



Dec. 15, 1998 



Sheet 2 of 4 



5,850,543 



L 2- 



to 
to 





CN 




o 


Q_ 


Q_ 


Q_ 


Q_ 


CO 


CO 


CO 


CO 


a: 


o; 


oc 


or 




CM 












^: 




RAST 


RAST 


RAST 



o 

Q- 

co 

00 



o 



Q_ 
CO 
CD 



CM 

CO 

m 



lO 
CM 



lO 
CM 



CM 

K. 
CO 
GD 



to 

CM 



CO 
CM 



CM 

a! 

CO 
CD 



CM 



CM 




co ^ y xa-i 

-J o 



co cr uj g 




CO 
CNI 



5OUJ 

O l~ ?t 



03/10/2004, EAST version: 1.4.1 



U.S. Patent 



Dec. 15, 1998 Sheet 3 of 4 



5,850,543 



FIG. 3 



— ii — 

T n (LA) 
— n 



| (OFFSET) 



11 

D n (TARGET) 
tl 



56 n 

jL 



HIS n 



BSP n (STACK PTR) 



55 



TAG (T n ) 


TARGET (0 n ) 


HIS 


BSP 


























(RASTK) 


(RSP) 










SPEL RA 


SPEL SP 



























































56 



FIG. 4 a 



55 



TAG (Tn) 


TARGET (D n ) 


HIS 


BSP 


























(RASTK) 


(RSP) 


Tl 10 


500 


011 


SP500 


SPEL RA 


SPEL SP 






















































120 


SP110 



56 



FIG. 4 b 



03/10/2004, EAST version: 1.4.1 



U.S. Patent 

55 



Dec 15, 1998 Sheet 4 of 4 5,850,543 



55 



55 



TAG (T n ) TARGET (D n ) HIS BSP 























(RASTK) 


(RSP) 


T|10 


500 


011 


SP500 


SPEL RA 


SPEL SP 



























































56 



FIG. 4 c 



TAG (T n ) 


TARGET (D n ) 


HIS 


BSP 


























(RASTK) 


(RSP) 


Tl 10 


500 


011 


SP500 


SPEL RA 


SPEL SP 
















T700 


120 


010 


SP120 





































56 



FIG. 4d 



TAG (T n ) 


TARGET (0 n ) 


HIS 


BSP 


























(RASTK) 


(RSP) 


T110 


500 


011 


SP500 


SPEL RA 


SPEL SP 
















T700 


120 


010 


SP120 
































120 


SP110 



56 



FIG. 4e 



03/10/2004, EAST version: 1.4.1 



5,8i 

1 

MICROPROCESSOR WITH SPECULATIVE 
INSTRUCTION PIPELINING STORING A 
SPECULATIVE REGISTER VALUE WITHIN 
BRANCH TARGET BUFFER FOR USE IN 
SPECULATIVELY EXECUTING 
INSTRUCTIONS AFTER A RETURN 

This invention is in the field of microprocessors, and is 
more specifically directed to program control techniques for 
assisting speculative execution in microprocessors of the 
pipelined superscalar type. 

Background of the Invention 

Significant advances have recently been made in the 
design of microprocessors to improve their performance, as 
measured by the number of instructions executed over a 
given time period. One such advance relates to the recent 
introduction of microprocessors of the "superscalar" type, 
which can effect parallel instruction computation with a 
single instruction pointer. Typically, superscalar micropro- 
cessors have multiple execution units, such as multiple 
integer arithmetic logic units (ALUs) and a floating point 
unit (FPU), for executing program instructions, and thus 
have multiple pipelines. As such, multiple machine instruc- 
tions may be executed simultaneously in a superscalar 
microprocessor, providing obvious benefits in the overall 
performance of the device and its system application. 

Another common technique used in modern microproces- 
sors to improve performance involves the "pipelining" of 
instructions. As is well known in the art, microprocessor 
instructions each generally involve several sequential 
operations, such as instruction fetch, instruction decode, 
retrieval of operands from registers or memory, execution of 
the instruction, and writeback of the results of the instruc- 
tion. Pipelining of instructions in a microprocessor refers to 
the staging of a sequence of instructions so that multiple 
instructions in the sequence are simultaneously processed at 
different stages in the internal sequence. For example, if a 
pipelined microprocessor is executing instruction n in a 
given microprocessor clock cycle, a four-stage pipelined 
microprocessor may simultaneously (i.e., in the same 
machine cycle) retrieve the operands for instruction n+1 
(i.e., the next instruction in the sequence), decode instruction 
n+2, and fetch instruction n+3. Through the use of 
pipelining, the performance of the microprocessor can effec- 
tively execute a sequence of multiple -cycle instructions at a 
rate of one per clock cycle. 

Through the use of both pipelining and superscalar 
techniques, modern microprocessors may execute multi- 
cycle machine instructions at a rate greater than one per 
machine clock cycle, assuming that the instructions proceed 
in a known sequence. However, as is well known in the art 
of computer programming, many programs do not neces- 
sarily run in the sequential order of the instructions, but 
instead include branches (both conditional and 
unconditional) to program instructions that are not in the 
current sequence, subroutine calls, unconditional jumps, and 
other types of non-sequential operation. Such operations 
clearly provide a challenge to the pipelined microprocessor, 
in that the instructions in the microprocessor pipeline may 
not be the instructions that are actually executed. For 
example, a conditional branch instruction may, upon 
execution, cause a branch to an instruction other than the 
next sequential instruction currently in the pipeline, based 
upon the execution results. In this event, the results of those 
instructions currently in the pipeline will not be used, and 
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the pipeline must then be "flushed", or emptied, so that the 
actual next instruction (i.e., the destination of the branch) 
can be fetched, decoded, and executed. This flushing spends 
multiple machine clock cycles before the execution of the 

5 next instruction can occur, and the intervening clock cycles 
required to re-fill the pipeline appear as idle cycles from the 
viewpoint of completed instructions. 

The effect of this non-sequential operation, and of the 
resultant flushing of the pipeline, is exacerbated in the case 

10 of superscalar pipelined microprocessors. If, for example, a 
branch or other interruption in the sequential instruction 
flow of the microprocessor occurs in such microprocessors, 
the number of lost pipeline slots, or lost execution 
opportunities, is multiplied by the number of parallel execu- 

15 tion units (i.e., parallel pipelines). The performance degra- 
dation due to branches and non-sequential program execu- 
tion is therefore amplified in superscalar pipelined 
microprocessors. 

In order to minimize the degradation of microprocessor 

20 performance that results from non-sequential program 
execution, many modern microprocessors now incorporate 
speculative execution based upon branch prediction. Branch 
prediction predicts, on a statistical basis, the results of each 
conditional branch (i.e., whether the branch will be "taken" 

25 or "not-taken"), and continues fetching instructions and 
operating the pipeline based on the predicted outcome of the 
condition. Those instructions that are fetched based upon 
such a prediction will proceed along the pipelines until the 
actual result of the condition is determined. If the prediction 

30 was correct, the speculative execution of the predicted 
instructions maintains the microprocessor at its highest 
performance level thr^ghj^ujljn^^ 
rj-the"evttrt'thanh^ 

£ must be^flushed" to"rem6veTirinstructions tfiaf have not yef\ 

$f compJeted7As~irknown inlhe art, the use of conventional 
branch prediction and speculative execution techniques has 
provided improved overall microprocessor performance. 

By way of further background, conventional speculative 
execution techniques have included the use of branch target 

40 buffers (BTBs) and return address stacks. Con ventional 

B TBs-are-cache-like-buffers-that-a re-us ed-in-the-fetch-u nits 
^f-micro processorsztoistoreraniidentifier-of-a^^eviouslv 
perfonned:branch:instmction:as:a:tag-along:with"th^trrget 
^ ddress^Qx^m eTaddress-to-whicrPthe^branch-points in its 

45 preclicted-state)-and-an4ndicalion^ 

Uponisubseq uent-fetchesof'the -branchr the-targel-address^is 
g:: usejKde pendin g -on-the"branch"histor v ) as'tHQ ext , add ress 
ttgzfgtcjon-the-pi pelme;-upon-execution-of-the- branch-^ 
instruction itself, the-tar get address-is co m pared-a gainst;the~~ 

50 actuaUnextiinstmcti ona doVes^jdetermined byjhe .execution. 
unitno-verify-whether-the-specuM 
Return address stacks, according to conventional 
techniques, store the next sequential instruction address to 
be executed after return from the subroutine (i.e., the next 

55 instruction in the calling program after a subroutine call), in 
similar fashion as the actual return address is stored in a 
logical stack upon execution of the call. The instruction 
address stored in the return address stack is used to specu- 
latively fetch the next instruction after the return. Upon 

60 execution of the return, this value from the return address 
stack is compared against the actual return address popped 
from the logical stack to verify whether the speculative 
pipeline operation was valid. 

Despite the use of these techniques, pipeline stalls can still 

65 occur in the event of branches and subroutine calls, due to 
conflicts (or "interlocks") in the use of certain microproces- 
sor resources. For example, an iastruction may require, at an 
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early stage in the pipeline, the contents of a certain register Other objects and advantages of the present invention will 

location that will not be written until the completion of the be apparent to those of ordinary skill in the art having 

execution stage of an earlier-in-time instruction. The inter- reference to the following specification together with its 

lock arises because the later instruction must wait until the drawings, 

register is written upon execution of the earlier instruction. 5 

While the pipeline does not need to be flushed in this event, SUMMARY OF THE INVENTION 
the instructions in the pipeline cannot advance until the 

interlock is resolved (i.e., until the register or other resource T° e invention may be implemented into a microprocessor 
is released by the earlier instruction). These interlocks can by providing extensions to existing return address stack or 
occur not only in the case of speculative execution, but also branch target buffer entries used in connection with selected 
in the case of unconditional branches, subroutine calls, and 10 branching instructioas, such as subroutine call and return 
the like. As should be readily apparent, such interlocks instructions. The extension provides a location at which to 
degrade the overall performance of the microprocessor, as store the contents of a register, for example the slack pointer, 
idle machine clock cycles are required in such cases. in association with the destination of the branching instruc- 
By way of further background, and as mentioned above, tion. The register contents stored in the extension is matched 
the use of a portion of memory as a logical "stack" is well 35 , in combination with the stored destination, with the con- 
known in the art. A conventional stack is implemented as a tents of the same register as used in speculatively executed 
group of multiple memory locations that are dealt with in a instructions following the branching instruction, to deter- 
last-in-first-out manner, where the contents of a register, mme whether the speculative execution was valid. The 
commonly referred to as the stack pointer, contain the frequency of interlocks in the pipelined operation of the 
current address of the top of the stack. The stack will be 20 microprocesS or is thus reduced, as values for the register are 
defined by the archUecture of the microprocessor; for made ayailable t0 mial ^^0^. 
example, the stack in x86-architecture microprocessors is 

that portion of memory in the SS segment to which the SP BRIEF DESCRIPTION OF THE DRAWINGS 
register points. Other architectures, such as the IBM 360 

architecture, may not use a stack (i.e., a portion of memory) 25 FIG. 1 is an electrical diagram, in block form, of a 

but may instead use a register that is identified by an operand superscalar microprocessor according to the preferred 

in the return instruction, to store the return address in a embodiment of the invention is implemented, 

similar fashion as a stack. Those architectures having stacks FIG. 2 is an electrical diagram, in block form, of the fetch 

also generally respond to simple instmctions such as PUSH ^ q{ ^ mi oessor of na h accordi t0 lhe 

and POP, to store data to and load data from the stack, „ - . , j. , f iU • 

. \ ,. r . . 4 , . t , . 30 preferred embodiment of the invention, 

respectively, modifying the stack pointer accordingly in r 

either case. The stack of a microprocessor is often used in FIG * 3 k a schematic representation of the contents of an 
connection with subroutine calls, as it provides a convenient emrv in the branch lar S et buffer ( BTB ) according to the 
conduit for the passing of parameters back and forth preferred embodiment of the invention, 
between a calling program and a subroutine. In addition, as 35 FIGS. 4a, 4b, 4c, 4d, and 4e are representations of the 
noted above, subroutine calls also generally PUSH the contents of the return address stack and branch target buffer, 
return address onto the stack, during their execution. with stack pointer extensions, at various stages of the 
It has been discovered, in connection with the present execution of a code fragment, according to the preferred 
invention, that subroutine calls in superscalar x86 architec- embodiment of the invention, 
ture microprocessors can give rise to interlocks due to 40 dftatt fd nF^PRTPTTON OP tt-tf 
conflicts regarding the stack pointer. This is because sub- Spp^ S 
routine calls and returns, each of which can be multiple- PREFERRED EMBODIMENT 
cycle instructions, perform stack operations (such as the Referring now to FIG. 1, an exemplary data processing 
PUSH and POP of the return address) and thus modify the system 2, including an exemplary superscalar pipelined 
stack pointer in their execution stage. Scalar microproces- 45 microprocessor 10 within which the preferred embodiment 
sors can typically assume a value for the stack pointer in 0 f the invention is implemented, will be described. It is to be 
speculatively executed instructions, based on the single understood that the architecture of system 2 and of micro- 
pipeline. However, in conventional superscalar micropro- processor 10 is described herein by way of example only, as 
cessor designs, instructions that immediately follow the it is contemplated that the present invention may be utilized 
execution of calls and returns, and that perform stack 50 i n microprocessors of various architectures, with particular 
operations (and thus modify the stack pointer), cannot be benefit to those of the superscalar type. It is therefore 
executed until the completion of the call or return, as the contemplated that one of ordinary skill in the art, having 
contents of the stack pointer may be modified by the reference to this specification, will be readily able to imple- 
execution of a parallel instruction. Similar problems may ment the present invention in such other microprocessor 
also arise in those architectures that use registers, rather than 55 architectures. 

a stack, for the storage of information relating to the target Microprocessor 10, as shown in FIG. 1, is connected to 

addresses of calls and returns. olher system devices by way of bus B. While bus B, in this 

It is therefore an object of the present invention to provide example, is shown as a single bus, it is of course contem- 

a superscalar microprocessor and method of operating the plated that bus B may represent multiple buses having 

same so as to avoid interlocks in call and return instructions. 60 different speeds and protocols, as is known in conventional 

It is a further object of the present invention to provide computers utilizing the PCI local bus architecture; single bus 

such a microprocessor and method in which interlocks are B is illustrated here merely by way of example and for its 

avoided by extending conventional stack and branch target simplicity. System 2 contains such conventional subsystems 

buffer entries to incorporate register values. as communication ports 3 (including modem ports and 

It is a further object of the present invention to provide 65 modems, network interfaces, and the like), graphics display 

such a microprocessor and method in which speculative system 4 (including video memory, video processors, a 

execution is assisted. graphics monitor), main memory system 5 which is typically 
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implemented by way of dynamic random access memory -continued 

(DRAM), input devices 6 (including keyboard, a pointing ^— ^_ _ — 

device, and the interface circuitry therefor), and disk system PD1 sla f 1: . ™ sta Sf exl ™f thc f*}**™*™ b > 1cs 

rt , , . , • , , , . i . ■ a .* i j i • and recedes them lato fixed length format for decode 

8 (which may include bard disk drives, floppy disk drives, ^ p^. ^ slagc translates the x86 instructions into atomic 

and CD-ROM drives). It is therefore contemplated that 5 operations (AOps) 

system 2 Of FIG. 1 corresponds to a conventional desktop SC Schedule: This stage assigns up to four AOps to the appropriate 

computer or workstation, as are now common in the art. Of execution units . . .. 

r , . - . , a OP Operand: This stage retrieves thc register operands indicated by 

course, other system implementations of microprocessor 10 lb r e AOps 

Can also benefit from the present invention, as will be EX Execute: This stage runs the execution units according to the 

recognized by those of ordinary skill in the art. 10 AOps and the retrieved operands 

WB Write back: This stage stores the results of the execution in 

Microprocessor 10 includes bus interface unit 12 that is registers or in memory 

connected to bus B, and which controls and effects commu- 

nication between microprocessor 10 and the other elements Re back {q F{Q ^ ^ above 

in system 2. BIU 12 includes the appropriate control and are rfor ^ ^ ^ micr0 . 

clock cu-cuitry to perform this function, including write proc £ ssor i0rFetcrTunir26 geWates-fetmction-addn^ 

buffers for increasing the speed of operation, and including ^-the^nstS^ ^ 

timing circuitry so as to synchronize the results of internal (translation-loo j^ide^bu ffer-^ 

microprocessor operation with bus B timing constraints. ^lojncal-instrmio^ 

Microprocessor 10 also includes clock generation and con- ^conventional way,-jDQ ^Ucaa oatale^lXiiistmclion.cache— 

trol circuitry 20 which, in this exemplary microprocessor 10, ^T6Rnstructiol^ 

generates internal clock phases based upon the bus clock Sjata t o fetch^ffiFY6rwhicti"'in m m,provide£the jiistnicti6n> 

from bus B; the frequency of the internal clock phases, in <£rjg^tOK^dec^ 

this example, may be selectably programmed as a multiple Speculative execution is primarilycoli^lled by fetch unit 

of the frequency of the bus clock. ^ 26, in a manner to be described in further detail hereinbelow. 

As is evident in FIG. 1, microprocessor 10 has three levels Predecoding of the instructions is broken into two parts in 

of internal cache memory, with the highest of these as level microprocessor 10, namely predecode 0 stage 28 and pre- 

2 cache 14, which is connected to BIU 12. In this example, decode 1 stage 32. These two stages operate as separate 

level 2 cache 14 is a unified cache, and is configured to pipeline stages, and together operate to locate up to three 

receive all cacheable data and cacheable instructions from 30 *86 instructions and apply the same to decoder 34. As such, 

bus B via BIU 12, such that much of the bus traffic presented the predecode stage of the pipeline in microprocessor 10 is 

by microprocessor 10 is accomplished via level cached 14, three instructions wide. Predecode 0 unit 28, as noted above, 

of course, microprocessor 10 may also effect bus traffic determines the size and position of as many as three x86 

around cache 14, by treating certain bus reads and writes as instructions (which, of course, are variable length), and as 

"not cacheable". Level 2 cache 14, as shown in FIG. 1, is 35 such consists of three instruction recognizers; predecode 1 

connected to two level 1 caches 16; level 1 data cache 16 d unit 32 recodes the multi-byte instructions into a fixed- 
is dedicated t o data , wli iIe-levein~instnrcti on,cachTri6,-is^ length format, to facilitate decoding. 

^gicatedao-mstmctions^Rower consumption by micropro- Decode unit 34, in this example, contains four instruction 

ccssor 10 is minimized by only accessing level 2 cache 14 decoders, each capable of receiving a fixed length x86 

only in the event of cache misses of the appropriate one of 40 instruction from predecode 1 unit 32 and producing from 

the level 1 caches 16. Furthermore, on the data side, micro- one to three atomic operations (AOps); AOps are substan- 

cache 18 is provided as a level 0 cache, and in this example tially equivalent to RISC instructions. Three of the four 

is a fully dual-ported cache. decoders operate in parallel, placing up to nine AOps into 

a t_ • t-*y/-i -« j « • • 1 the decode queue at the output of decode unit 34 to await 

As shown in FIG. 1 and as noted hereinabove, micropro- , , f . , / . , c . , 

. t T ... . i5 , 45 scheduling; the fourth decoder is reserved for special cases. 

cessor 10 is of the superscalar type. In this example multiple c . , , \ c 1 . r r .u 1 1 

r j *n Scheduler 36 reads up to four AOps from the decode queue 

execution units are provided within microprocessor 10, , . , r 1 j *. *a 1 as? . 

... * . , ■ . 1 . at the output or decode unit 34, and assigns these AOps to 

allowing up to four instructions to be simultaneously . r . , . t T , 0 . J , . 

, j . ,1 1 r • t • . • . . the appropriate execution units. In addition, the operand unit 

executed in parallel for a single instruction pointer entry. aa • j , l j * t - A 

™ **. -if * itt a** a* c 44 receives and prepares the operands tor execution. As 

These execution units include two ALUs 42 0 , 42, for . t . . „ jr , \ r , / AA . . - 

... . . . ■ . j 1 • 1 50 indicated in FIG. 1, operand unit 44 receives an input from 

processing conditional branch, integer, and logical ^ u ^ alsQ from microcode R0M ^ £ ia mu] . 

operations, floating-point unit (FPU) 30, two load-store A - j r * u • . j e • . L 

At\ \n 1 * ao ^ t ii« tiplcxer 45, and fetches register operands for use in thc 

units 40 o , 40,, and microsequencer 48. lne two load-store r . . 4 f jJV j- . .u- 

At? .-r <u t . • u 10 r , execution of the instructions. In addition, according to this 

units 40 utilize the two ports to microcache 18, for true . , . 4 - . 1 * , 

... ... . . c , . , A example, operand unit performs operand forwarding to send 

parallel access thereto, and also perform load and store u r / r . t t ,\ . , . , & , . 

r - t . . • . <Ti m r* . * 55 results to registers that are ready to be stored, and also 

operations to registers in register file 39. Data microtrans- c * a. 1 1 j . 

, % . , . e T , nx - 0 . * j j * 1 . performs address generation for AOps of the load and store 

lation lookaside buffer (wTLB) 38 is provided to translate r & r 

logical data addresses into physical addresses, in thc con- \' m . An . ... . * . , 

, ri Z i; . Microsequcncer 48, in combination with microcode ROM 

ventional manner. .^^ . 

46, control ALUs 42 and load/store units 40 in thc execution 

These multiple execution units are controlled by way of 60 of microcode entry AOps, which are generally the last AOps 

multiple seven-stage pipeline These stages are as follows: to execute in a cycle. In this example, microsequencer 48 

sequences through microinstructions stored in microcode 

ROM 46 to effect this control for those microcoded micro- 

F Felch: This stage generates the instruction address and reads the • „ t «• „ n _ 1 c m - • 

■ . .. f ? • . .• u instructions. Examples of microcoded microinstructions 

instruction from the instruction cache or memory • 1 1 r • t*\ 

PD0 Predecode stage 0: This stage determines the length and starting 65 include, for microprocessor 10, complex or rarely-USed x86 

position of up to three fetched xS6-type instructions instructions, x86 instructions that modify segment or control 

registers, handling of exceptioas and interrupts, and multi- 
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cycle instructions (such as REP instructions, and instructions 
thai PUSH and POP all registers). 

Microprocessor 10 also includes circuitry 24 for control- 
ling the operation of JTAG scan testing, and of certain 
built-in self-test functions, ensuring the validity of the 5 
operation of microprocessor 10 upon completion of 
manufacturing, and_upon\resets and other events. 

Referring now to FIG, 2^the construction and operation of 
fetch unit 26 according'tcT the preferred embodiment of the 
invention will now be described. As noted above, fetch unit 10 
26 performs the function of determining the address of the 
next instruction to be fetched for decode. As such, fetch unit 
26 determines the sequence in which instructions are loaded 
into the pipelines of microprocessor 10, and in this embodi- r 
ment of the invention thus controls the speculative execution 
of addresses, particularly by way of branch prediction. < - 

The operation of fetc h unit 26 is based u pon judical 
/^ fejch^a d clress FA jhaUs. generated according to.-o ne^of 
^several'waysrFetc gaddres s FA mav-be-generated-merely-by^ 1 



fetch unit 26, for eventual presentation to predecode 0 stage 
28. In the case where each physical address PA addresses a 
block of sixteen instructions, instruction buffer 60 has a 
capacity of sixteen instructions. 

Fetch unit 26 also includes other conventional functions, 
such as instruction break check circuit 62 which halts 
additional fetching for instructions identified as breaks. 
Fetch unit 26 also includes a code segment limit check 
circuit 64, for determining whether fetch address FA is 
outside the limit of the bounds of the current code segment. 

C"I36^ icTI~fetcrragaress l v A"is co nnected to an jnput_of BTBl) 
^56,~^cl f]detelTOine! i^ 

~b1ranch~instr^ion~triat"has recently been"fetch^r and"wHic lh 
^mavZhave-branch^history stored in„BTB„5,6^for use in? 
]g pe^laTiye Z5xcntjon pAs noted hereinabove, speculative 
execution is an especially important performance enhance- 
ment in deeply pipelined microprocessors such as supersca- 
lar microprocessor 10 of FIG. 1, as mispredicted branches 



^ncremcntmg-oHetch-pointer^ (f)r j line slalls awailing lne resuhs of a conditional 

^case where _the next seq uential address is _to„be-fetched-for 
<decodin|^\s^hown-in"FIGr2rfetch"pointer 50-is-aregister 
(^iiTfetch"uj uV26,,haying,an,increment.control input.INC^and 
/which presents"itggutput"to one"input"of-multiplexer"52-A 



branch) result in severe penalties, measured in lost execution 
opportunities^CT : ^isTlnem ory arranged'in.axacHellikeTD 
cojtfg^ultiolirio^^am 



data entries DO through D127; each way of BTB 56 further 
includes 128 speculative stack pointer entries SP0 through 
SP127 for assisting speculative execujion,_a_s___will_be 
described in further detail hereinbelowTSdditionaibits-sucn^ 



FIG. 3 illustrates a single tag and associated entry S6 n in 
BTB 56, according to this preferred embodiment of the 
invention. Tag T„ shown in FIG, 3 includes a logical address 
portion LA that is the address of a recently performed 
"branching" instruction, i.e., an instruction that recently 
> effected a no n -sequential instruction fetch (such as a 



^jh ej~start ing.ofEsei-of^the„specific„instruction- wittnn~tfie 
sjxje^rrinsuiIctionxod£l ^ 



. _ ^ A . , , r ciative cache buffer. FIG. 2"illu strates BTB'56"in a simplistic 

csccon^ y-inwhich-the fetc ^ way in BTB 56, uHh£ 

o^e^th^xeomoj^ k exampieTha7l28'taisl ; 0 through T127 associated with 128 

event:ofia:branc h'that" is:notlprecuCtedIby:iet ch"untt"26" (as 
^wiU~bezde^ribe.d~heremb^ 

^b^ess o^the-next-m structiorrt o-be-fetched is" g enerated i n3> 
th^.execution-stage-ofaheTpi^Un^theIfetch:a ddress "FA'is 30 

presente*-^ f^ERTJfrtslir^ 

P l<^r52^h-umjj^oIu^^ irTWS^ar ed amon g the ways. \ 

^--ing -tne next-fetch address'FA in jwavsjhat.are.not-in program-b- * 

sejjujsm^ej-As^ho^^iOT 

cajddress^taclT55^tiichTisXl ast-in-fire 35 
<having-severaHlocjrtioj^^ 
<su1^utin e~caUs^and"su1^ b^^ 

sp eculativeexecution T ^^UIl^e TdescnT^rm-further-detail-^ 
f-hereinbel owr- In~a ddition,^as -wiILb^Ide^irj ^~ih~fuTth~er^^_ 

deTajFhel^inbelow^fetch^ 
^target buffeT(BTB^^ 

C entri es^that ^store.target-addresse s^of branches ^ and data 
indicating the predicted condition of the branch, from which 
fetch addresses FA may be generated to maintain the pipe- 
line in a filled condition based upon prediction of the branch. 
OutpuLs from return address stack 55 and BTB 56 are 
presented, by way of multiplexer 57, to the third input of 
multiplexer 52, and the appropriate one of these outputs is 
used to update fetch counter 50, under the control of 
multiplexer 58. The three inputs to multiplexer 52 thus 
present three ways in which fetch address FA is generated, 
depending upon the state of operation of microprocessor 10. 

Fetch address FA is presented, in fetch unit 26, to various 
functions therein in order to control the fetching of the next 
instruction for decoding. For example, fetch unit 26 is in 
communication with instruction //TLB 22, which may 
quickly return a matching physical address PA for logical 
fetch address FA if an entry for fetch address FA is contained 
therein. In any event, a physical address is presented by fetch 
unit 26, either directly or from instruction //TLB 22 via 
multiplexer 59, to instruction level 1 cache 16, for retrieval 
of instruction code therefrom; of course, if a cache miss at 
instruction level 1 cache 16, occurs, the physical address PA 
is presented to unified level 2 cache 14 and, in the event of 
a cache miss at that level, to main memory. In response to 
physical address PA, instruction level 1 cache 16/ presents an 
instruction code sequence CODE to instruction buffer 60 in 



^I^]?J ternativel y,-ph ysical-addresses may be used; as:th"eTtag^ 
^irrjTB'56 rif~desire d^Entry 56n has, associated with tagT„, 
45 a data entry D„ that corresponds to the target address of the 
branching instruction identified by tag T w . 

Following the data entry D„ in entry S6n is a three-bit 
history field HIS M , which indicates the branch history (and 
thus predicted state) of the branching instruction, and also 
the type of branch, corresponding to entry 5 6/1. For purposes 
of this example, branching instructions that can initiate 
speculative execution include conditional branch 
instructions, other non-sequential instructions such as sub- 
routine calls and returns, and unconditional branches. As 
such, the type of branching instruction is indicated in history 
field HIS M , as the prediction need only apply to conditional 
branches. In this example, history field HIS„ is a three-bit 
field, with the indication of branch type and prediction as 
follows: 

60 



50 



55 



65 



111 
110 
101 
100 
Oil 
010 



conditional branch 
conditional branch 
conditional branch 
conditional branch 
CALL 
RETurn 



Strongly Predicted Taken (ST) 
Predicted Taken (T) 
Predicted Not Taken (NT) 
Strongly Predicted Not Taken (SNT) 
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10 



001 
000 



unconditional branch (JUMP) 
invalid 



CThej^wes'ST;!^ of a 

condilional::branch_ar^ of the 

coTiaitiojiajjtancta^n^ 

( new-conditional~branchl)btai^ upon 
its"fi^-execution;:this:history:isistored:in:BTB'56 with the 
entr^forahat-branch-kstruction.-If-roesam in 
a second su^essiv^ occurrence-o f the branchrthe-strongly" 
slates^rejeliteTedj-for-exampIeT^wo successive not -taken 
results sets the--hisjoj;y-for~a~b ranch 2to SNT, and two 
successive taken-results^com^rsejyjjets the history to ST. If 
a history field^fflS^is^ serto ~a^stronalv" state, the next 
opposite result will-move~the- history^ information to a 
"not-strongly" state ;~fbr example~if^an SNT branch is 
"taken", its history is then ch~angedto-NTi Of course, since 
CALLs, RETurns, and JUMPs are unconditional, no predic- 
tion or history is appropriate. 

According to the preferred embodiment of the invention, 
each entry 56„ in BTB 56 also includes a field BSP„ by way 
of which a register value may be associated with the branch 



for nested subroutines. Similarly, upon execution of the 
subroutine call, the appropriate execution unit (e.g., 
microsequencer 48) will push the return address onto logical 
stack 7 in main memory 5. As is known in the art, fetch unit 
26 will continue to maintain a full pipeline during the 
operation of the subroutine by fetching the subroutine 
instructions in sequence. Upon such time as the return from 
the subroutine is fetched by fetch unit 26, speculative 
execution of the return instruction and the sequential instruc- 
10 tions following the call (i.e., those instructions in the calling 
program sequence following the call) is performed by 
popping, from return address stack 55, the speculative return 
address value for use in the speculative execution of the 
return. Upon execution of the actual return, this speculative 
15 return address is compared with the actual return address 
popped from logical stack 7, to verify the validity of the 
speculative execution. 

Return address stack 55 may also store optional control 
information for each entry. This control information, as is 
20 known in the art, may include such information as validity 
bits, type bits, and the like. 

According to this preferred embodiment of the invention, 
return address slack 55 also includes, for each entry, a 
portion RSP for storing the value of a register, such as the 



or call instruction. Specifically, as will be described in 
further detail hereinbelow, the value of the stack pointer SP 25 stack pointer, along with the speculative return address. As 



at the time of a subroutine call or return will be stored in the 
field BSP„, at the time that the remainder of the entry 56„ is 
stored in BTB 56 for the call or return. The value of the stack 
pointer is presented to BTB 56 from register file 39 on lines 
SPIN. As will be described in further detail hereinbelow, 
speculative execution of the sequential instructions follow- 
ing the call or return may then be performed using the value 
of the stack pointer that is stored in BTB 56 (referred to 
hereinafter as the "branch-speculative stack pointer), as 



in the case of BTB 56, the value of the speculative slack 
pointer to be stored in return address stack 55 is provided by 
the stack pointer in register file 39 during the fetch stage of 
the subroutine call instruction, on lines SPIN. In the specu- 
30 lative execution of the subroutine return and following 
instructions, the speculative stack pointer value stored in 
return address stack 55 is provided along with its associated 
speculative return address. Upon execution of the actual 
subroutine return, this speculative stack pointer value is 



presented to predecode 0 stage 28 on lines SP^^y- along with 35 compared against the actual stack pointer value generated by 

the fetched speculative instruction; following the execution the execution unit, to verify validity of the speculative 

of the call or return, the actual value of the stack pointer as execution. 

calculated by the execution unit is compared against the As described above, microprocessor 10 according to this 

branch-speculative stack pointer value used in the specula- embodiment of the invention incorporates stack pointer 

live execution, to verify the validity of the speculative 40 extensions for both of the branch target buffer (BTB) 56 and 

pipeline. the return address stack 55. It is contemplated that providing 

As is conventional in microprocessors, the execution of a both of these extensions is preferred for the highest perfor- 

subroutine call conventionally involves a push of the return mance of microprocessor 10. However, it is also contem- 

address, which is the next sequential address after the call plated that either one or the other of these stack pointer 

(i.e., the instruction to which program control is to be passed 45 extensions may be used without the other, and benefits 



upon return from the call) onto logical stack 7 in main 
memory 5 (as shown in FIG. 1). Upon execution of the 
return from the subroutine, the return address is popped from 
the return address stack, and presented to fetch unit 26 by the 
execution unit for use as fetch address FA. 

According to this embodiment of the invention, fetch unit 
26 includes return address stack 55, having multiple entries, 
each of which includes a first portion RASTK„ for storing 
the next sequential instruction address as a speculative 
return address for each subroutine call, and also includes a 
second portion RSP„ for storing an associated value of a 
register, such as the stack pointer, for use in the speculative 
execution. As noted hereinabove, the use of a return address 
stack for storing the next sequential instruction address after 
a subroutine call is known in the art. As a result of the fetch 
of a subroutine call instruction, fetch unit 26 stores, in return 
address stack 55, the address of the next sequential instruc- 
tion following the call in the calling program (e.g., the main 
program), which is the instruction to which control will 



provided thereby, within the spirit of the present invention. 
In addition, while microprocessor 10 is described herein as 
using these extensions for storage of the stack pointer, it is 
also contemplated that the contents of other registers may be 
50 similarly associated with the destinations of branching 
instructions in this fashion. For example, in an x86 archi- 
tecture microprocessor which uses segmented addressing for 
the stack pointer, one may also or instead associate the slack 
segment, which is the base address of the stack pointer, with 
the target of the branching instruction. Further in the 
alternative, the present invention may associate the code 
segment with branching instruction destinations of "far" 
calls in this manner, or still further in the alternative may 
associate an entire set of machine states with the destination 
of a task switch operation. Further in the alternative, the 
stack pointer extensions may alternatively store a pointer to 
the stack pointer register, to effect indirect access of the 
stack pointer. 

The operation of microprocessor 10, incorporating exten- 



55 



60 



likely pass upon completion of and return from the called 65 sions for both BTB 56 and return address stack 55 according 
subroutine. This return address is stored in return address to this embodiment of the invention, will now be described 
stack 55 in UFO fashion, with the multiple entries allowing in combination with an exemplary code fragment, including 
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a subroutine call, which is repeated. Of course, this code subroutine EXMPL for use by instruction 120. BTB 56 also 
fragment is shown for purposes of example only, as the receives an entry with a tag T u0 corresponding to instruc- 
present invention is useful in other types of branching tion 110, a target value of 500 (the logical instruction address 
situations, as well. of the subroutine EXMPL), and a history field of 011 
An example of a code fragment with which the present 5 (indicating that instruction 110 was a CALL); also, accord- 
invention is utilized is as follows: mg t0 ^ embodiment of the invention, the stack pointer 

extension of BTB 56 receives, on lines SPIN from register 
~~ — ^— ^— — — — ^ ^ _ gj e ^ me va j ue 0 f tbe stack pointer that is to be used by 
ioo push ax instruction 500 in the subroutine EXMPL (as branch- 
es push cx io speculative stack pointer value SP 500 ) in a manner associ- 
no call exmpl ated with the tag for instruction 110. Since BTB 56 is a 
i?5 pop ax cache -like configuration, the location thereof at which these 

values are stored are not necessarily in a physical order, but 

will instead depend upon the value of T n0 . 

is In this first pass through this code fragment, an interlock 

In this code fragment, parameters are pushed onto the stack may develop relative to the stack pointer, depending upon 

tn instructions 100 and 105 to pass them to the subroutine, the num b er of instructions in subroutine EXMPL before 

in the conventional manner. The results of the execution of instruction 560 accesses the slack. However, the execution 

the subroutine are then returned to the register file upon of the CALL of instruction 110 may not be finished with the 

return in instructions 120 and 125, also in the conventional 20 stack and stack poimer at the time that the fetch and decode 

manner. The exemplary subroutine EXMPL, which includes of i nstruct ion 560 could otherwise begin. A pipeline stall 

stack operations and thus modifications to the stack pointer, may thus ^ present in lhis first pass 

is as follows: Pipelined execution of subroutine EXMPL thus 

continues, until such time as the RETurn instruction 700 is 

500 subroutine exmpl 25 decoded by decode unit 34, at which lime RETurn instruc- 
tion 700 is first recognized as a subroutine return in this pass 

560 pop ax through the code fragment. At thisjioint, the pipeline behind 

565 pop cx RETurn instruction 700 isjflushed^Fetch unit 26 then pops 

570 push ax ^ e sp^cu^l^ 6 return address 120 from return address stack 

30 55, along with its associated speculative stack pointer value 

590 push cx SP 1J0 ; as noted above, this value SP 110 is expected to 

correspond to the stack pointer value that instruction 120 

700 RET w ill require, given the sequence of the calling program. 

Fetch unit 26 then presents the code for instruction address 

Referring now to FIGS. 4a through 4e, the operation of 35 120 (e.g., from instruction level 1 cache 16^) to predecode 0 

fetch unit in speculative executing this code fragment, stage 28 along with the associated speculative stack pointer 

according to the preferred embodiment of the invention will value SP 110 for processing through the pipeline of micro- 

now be described in detail. FIG. 4a illustrates the contents processor 10. Execution of RETurn instruction 700, as is 

of BTB 56 and return address stack 55 in their initial state well known, involves implicit operations on logical stack 7 

prior to completion of the first execution of instruction 100. 40 (including at least the POP of the return address) and thus 

As shown therein, upon the initial pass through the code modification of the stack pointer, as is well known. In prior 

fragment, the contents of BTB 56 and return address stack superscalar microprocessors, this use of the stack pointer in 

55 (for those locations relevant to this code fragment) are the execution of the RETurn instruction 700 would cause an 

empty. interlock that would delay the pipelining and speculative 

As noted above, instructions 100 and 105 perform stack 45 execution of instruction 120. However, the use of the 

operations to pass parameters to subroutine EXMPL, and as speculative stack pointer value SP n0 in the fetching of 

such update the contents of the stack pointer. In addition, as instruction 120 et seq., according to this embodiment of the 

is well known, the execution of a subroutine CALL also invention, prevents any such interlock that would otherwise 

involves implicit stack operations, which will also update arise from the possibly conflicting use of the stack and stack 

the value of the stack pointer. Since this is the first pass 50 pointer in the execution of the RETurn instruction 700 and 

through this code fragment, instruction 110, which is the by the speculative fetching and decoding of instruction 120. 

CALL to subroutine EXMPL, is not recognized by BTB 26, The contents of BTB 56 and return address stack 55 are 

as there is no lag therein which matches that of instruction shown, at this point after the first pass fetch of RETurn 

110. instruction 700 and before its execution, in FIG. 4c. 

Upon the execution of instruction 110, however, both 55 Upon execution of RETurn inst met ion 700, the execution 

BTB 56 and return address stack 55 are updated with entries unit compares both the speculative return address 120 and 

pertaining to this CALL, as shown in FIG. 4b. Upon also the speculative stack pointer value SP 110 used in the 

execution of the CALL instruction 110, return address stack speculative execution against the actual values of the return 

55 receives, from the execution unit, an entry 120 which is address and stack pointer, respectively, generated by the 

the logical instruction address of the next sequential instruc- 60 execution unit in effecting the return. If these values both 

lion 120 in the calling program following the CALL: accord- match, the speculative pipeline execution of instruction 120 

ing to this embodiment of the invention, the stack pointer (and subsequent instructions) by microprocessor 10 was 

extension of return address stack 55 also receives, on lines successful, and the pipeline remains filled. If either of these 

SPIN from register file 39, a speculative stack pointer value values do not match, however, the speculative execution is 

SP 110 which is the value of the stack pointer before the 65 invalid, and the pipeline must be flushed, 

execution of the CALL of instruction 110, and which will In either case, upon execution of RETurn instruction 700, 

thus be the value of the stack pointer upon return from the BTB 56 receives another entry corresponding to this instruc- 
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tion. As shown in FIG. 4t/, this entry includes a tag T 700 
identifying the branching instruction RETurn 700, a target 
value pointing to the instruction address of instruction 120 
(i.e., the target of the RETurn), and a history value 010 
indicating that the branching instruction is a subroutine 5 
return; according to this embodiment of the invention, BTB 
56 also stores, associated with this entry, a branch- 
speculative stack pointer value SP 120 , which is the current 
stack pointer value (that to be used by instruction 120), and 
therefore is the stack pointer value that is likely to be used 10 
on future passes through the subroutine EXMPL when called 
from instruction 110, as in this example. 

For purposes of this example, the operation of micropro- 
cessor 10 according to this embodiment of the invention will 
now be described as it executes a second or subsequent pass is 
of the code fragment shown hereinabove. This second pass 
will, of course, initiate with the fetch of instruction 110 in 
the pipeline. However, in this second pass, BTB 56 already 
has an entry stored therein that is identified by tag T 130 , and 
that points to instruction 500 as the target of the subroutine 20 
CALL (indicated by HIS field 011). Fetch unit 26 will thus 
use the target address 500 from BTB 56 to fetch the 
instruction code for forwarding to predecode 0 stage 28. 

In addition, according to this embodiment of the 
invention, the extension of BTB 56 also has an entry SP 500 25 
as a branch-speculative stack pointer value that fetch unit 26 
will send along with the instruction code for target instruc- 
tion 500 as it progresses through the pipeline. This "hit" by 
BTB 56 in identifying the subroutine CALL of instruction 
110 thus enables the speculative fetching and pipeline 30 
advancement of this CALL to subroutine EXMPL, and 
passes not only the target instruction address 500 but also 
passes a stack pointer value SP 500 that fetch unit 26 sends to 
predecode 0 stage 28 on lines SP OUT . Stack pointer value 
SP 500 may be passed along the pipeline in several ways. For 35 
example, microprocessor 10 may include a special path to 
which lines SP oirr are connected so that stack pointer value 
SP 500 follows instruction 500 through the pipeline, for 
example as an "immediate" operand. Alternatively, stack 
pointer SP 500 may be stored in a temporary register in 40 
register file 39, for subsequent retrieval in the operand stage 
of the pipeline. Further in the alternative, stack pointer value 
SP 500 may bypass into a register file as a new "instance" of 
the stack pointer SP, if microprocessor incorporates register 
renaming techniques for avoiding pipeline dependencies. 45 

In any event, according to this embodiment of the 
invention, the interlock that occurred due to stack pointer 
conflicts on the first pass through this code fragment, as 
described above, does not occur in subsequent passes 
through the code, due to the storage of the branch- 50 
speculative stack pointer value in BTB 56 and its 
forwarding, with the instruction sequence, through the pipe- 
line. 

As before, the execution of the CALL instruction 110 will 
store a speculative return address 120 and a speculative 55 
stack pointer value SP 110 in return address stack 55. The 
contents of BTB 56 and return address stack 55 after the 
execution of CALL instruction 110 on the second pass are 
shown in FIG. 4e Subroutine EXMPL is then executed, in 
pipelined fashion, as in the manner described hereinabove 60 
for the first pass. At the point in the sequence at which the 
RETurn instruction 700 is again fetched, fetch unit 26 pops 
the speculative return address 120 from return address stack 
56, and presents the instruction code fetched therewith to 
predecode 0 stage 28 along with the speculative stack 65 
pointer value SP JJ0 that was stored in return address stack 55 
in association with the speculative return address value. 



Speculative execution of instruction 120 is then performed 
as in the first pass, with the verification of the actual return 
address and stack pointer value against the speculative 
values therefor again performed upon execution of the 
RETurn instruction 700. 

Of course, in the fetching of RETurn instruction 700 
through this second pass, BTB 56 will indicate that an entry 
is present for this branching instruction, including a target 
instruction address 120 and also a branch-speculative stack 
pointer value SP I20 . Fetch unit 26 may therefore use these 
values to fetch the instruction code for instruction address 
120, for presentation to predecode 0 stage 28 along with 
branch-speculative stack pointer value SP 120 . Speculative 
execution of the instructions following the return from 
subroutine EXMPL may then carry on, without interlocks 
due to the conflict over the slack pointer, using these 
speculative address and stack pointer values. In this embodi- 
ment of the invention, however, where both BTB 56 and 
return address stack 55 include extensions for storing slack 
pointer values, it is preferred that fetch unit 26 use the 
speculative stack pointer value from return address stack 55 
rather than the branch-speculative stack pointer value from 
BTB 56, as return address stack 55 was more recently 
written than was BTB 56, and thus its contents are more 
likely to match the actual return address and actual stack 
pointer value on execution of the return than are the contents 
of BTB 56. 

As noted above, however, an alternative microprocessor 
construction may utilize only the stack pointer extension for 
BTB 56, and may not incorporate either return address stack 
55 or the stack pointer extension thereof. In this case, the 
branch -speculative stack pointer value stored in BTB 56 will 
be used in the speculative execution of instructions follow- 
ing the return from subroutines, as described hereinabove. In 
addition, the stack pointer extension for BTB 56 also allows 
one to pass speculative stack pointer or other register values 
with conditional branches and other branching instructions. 

As is apparent from the foregoing description, the pre- 
ferred embodiment of the invention provides important 
advantages in the performance of a microprocessor and its 
data processing system, by enabling the storing of a register 
value in association with the destination instruction address 
of a branch operation. Particular advantages arise from the 
ability of a microprocessor constructed according to the 
preferred embodiment of the invention to speculatively 
execute instructions following a branch that involve stack 
operations and modifications to the stack pointer, as inter- 
locks are prevented that may otherwise occur over conflicts 
in the use of the stack pointer, especially in the case of 
superscalar microprocessor architectures. Due to the large 
number of subroutine calls and returns that are present in 
many conventional computer programs, it is contemplated 
that the present invention will provide an important perfor- 
mance enhancement in these systems. 

While the present invention has been described according 
to its preferred embodiments, it is of course contemplated 
that modifications of, and alternatives to, these 
embodiments, such modifications and alternatives obtaining 
the advantages and benefits of this invention, will be appar- 
ent to those of ordinary skill in the art having reference to 
this specification and its drawings. It is contemplated that 
such modifications and alternatives are within the scope of 
this invention as subsequently claimed herein. 

We claim: 

1. A pipelined microprocessor, comprising: 
a plurality of execution units for executing a plurality of 
instructions simultaneously; 
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an instruction decode unit, for decoding instructions; 

an instruction memory for storing instruction codes 
according to instruction addresses; 

a fetch unit, for retrieving instruction codes from the 
instruction memory for a series of instructions, said 5 
fetch unit operating to retrieve a second instruction 
simultaneously with the execution of a first instruction 
by one of the plurality of execution units, said fetch unit 
comprising: 

a branch prediction function for storing a speculative 10 
target instruction address upon execution of said first 
instruction corresponding to the address from which 
to continue execution subsequent to execution of a 
return type instruction, and for storing, in association 
with the speculative target instruction address, a 15 
speculative register value for use in speculatively 
executing instructions following said return-type 
instruction. 

2. The microprocessor of claim 1, wherein the branch 
prediclion function comprises: 20 

a branch target buffer, having a plurality of entries, each 
entry having a tag portion for storing an identity 
indicator for a branching-type instruction, having a 
target portion for storing the target instruction address 
in association with the tag portion, and having a specu- 
lative value portion for storing the speculative register 
value in association with the tag and target portions. 

3. The microprocessor of claim 2, wherein the speculative 
value portion of each of the plurality of entries in the branch 
target buffer is for storing a speculative stack pointer. 30 

4. The microprocessor of claim 3, wherein the branch 
prediction function further comprises: 

a return address stack for storing a speculative return 
address as the target instruction address and for storing, 35 
in association with the speculative return address, a 
speculative stack pointer value; 

wherein the fetch unit stores the speculative return 
address and associated speculative stack pointer value 
in the return address stack responsive to executing an 4Q 
instruction of the subroutine call type; 

and wherein the fetch unit retrieves the speculative return 
address and associated speculative stack pointer value 
responsive to fetching an instruction of the subroutine 
return type. 45 

5. The microprocessor of claim 1, wherein the branch 
prediction function comprises: 

a return address stack for storing a speculative return 
address as the target instruction address and for storing 
a speculative stack pointer value in association with the 50 
speculative return address; 

wherein the fetch unit stores the speculative return 
address and associated speculative stack pointer value 
in the return address stack responsive to executing an 
instruction of the subroutine call type; ss 

and wherein the fetch unit retrieves the speculative return 
address and associated speculative stack pointer value 
responsive to fetching an instruction of the subroutine 
return type. 

6. The microprocessor of claim 1, wherein the instruction 60 
memory is dedicated to storing instructions. 

7. The microprocessor of claim 6, wherein the instruction 
memory comprises a first level instruction cache. 

8. The pipelined microprocessor according to claim 1, 
wherein said return type instruction is a return instruction. 65 

9. The pipelined microprocessor according to claim 1, 
wherein said return type instruction is an instruction pair 



wherein the first instruction pops a stack address into a 
register and the second instruction branches to said stack 
address in said register. 

10. A method of operating a pipelined microprocessor to 
speculatively execute instructions, comprising the steps of: 

fetching a first instruction from an instruction memory, 

responsive to an instruction address; 
decoding the first instruction fetched from the instruction 

memory; 

executing the decoded first instruction in one of a plurality 
of execution units, which stores a speculative target 
instruction address and speculative register value in a 
branch prediction function; 

fetching a second instruction which according to said 
branch prediction function's prediction, corresponds to 
said speculative target instruction address and specu- 
lative register value; 

fetching and decoding a third and subsequent instructions 
using the speculative target instruction address; and 

executing said third and subsequent instructions and 
thereby supplying said speculative register value for 
use by the third and subsequent instructions until said 
second instruction executes and provides an actual 
target instruction address and an actual register value. 

11. The method of claim 10, further comprising: 
executing said second instruction to generate said actual 

target instruction address and said actual register value; 

comparing the actual target instruction address to the 
speculative target instruction address; 

comparing the actual register value to the speculative 
register value; and 

responsive to the actual target instruction address match- 
ing the speculative target instruction address and to the 
actual register value matching the speculative register 
value, continuing the execution of said subsequent 
speculative instructions, 

12. The method of claim 10, wherein the first instruction 
corresponds to a subroutine call; 

wherein the second instruction corresponds to a subrou- 
tine return; 

wherein the speculative target instruction corresponds to 
the next sequential instruction after the subroutine call 
in a computer program containing the subroutine call. 

13. The method of claim 12, wherein the register value 
corresponds to a stack pointer. 

14. The method of claim 12, wherein the step of storing 
a speculative target instruction address stores the speculative 
target instruction address in a return address stack. 

15. The method of claim 12, wherein the step of storing 
a speculative target instruction address stores the speculative 
target instruction address in a branch target buffer, associ- 
ated with a tag value corresponding to the second instruc- 
tion. 

16. The method of claim 15, further comprising: 
repeating the step of fetching the second instruction; 
responsive to the step of fetching the second instruction, 

fetching the speculative target instruction address from 
the branch target buffer and the speculative register 
value associated therewith. 

17. A microprocessor-based computer system, compris- 
ing: 

an input device; 
a display system; 
a main memory; and 
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a microprocessor, coupled to the input device, display 
system and main memory, and comprising: 
a plurality of execution units for executing a plurality 

of instructions simultaneously; 
an instruction decode unit, for decoding instructions; 5 
an instruction memory for storing instructions accord- 
ing to instruction addresses; 
a fetch unit, for retrieving instructions from the instruc- 
tion memory for a series of instructions, said fetch 
unit operating to retrieve a second instruction simul- 10 
taneously with the execution of a first instruction by 
one of the plurality of execution units, said fetch unit 
comprising: 

a branch prediction function for storing a speculative 
target instruction address upon execution of said is 
first instruction corresponding to the address from 
which to continue execution subsequent to execu- 
tion of a return-type instruction, and for storing, in 
association with the speculative target instruction 
address, a speculative register value for use in 20 
speculatively executing instruction following said 
re turn -type instruction. 

18. The system of claim 17, wherein the main memory 
includes a logical stack for storing a return address respon- 
sive to the microprocessor performing a subroutine call 25 
instruction; 

wherein the microprocessor further comprises a stack 
pointer register, for storing an address corresponding to 
a current memory location in the logical stack; 

wherein the speculative target instruction address stored 30 
in the branch prediction function corresponds to the 
return address; 

and wherein the speculative register value stored in the 
branch prediction function corresponds to the value of 35 
the stack pointer register. 

19. The system of claim 18, wherein a first execution unit 
initiates execution of instructions corresponding to the 
speculative target instruction address and those addresses of 
the next sequential instructions following said speculative 4Q 
target instruction address, using the speculative register 
value associated therewith; 

and wherein, upon execution of a subroutine return 

instruction, the first execution unit: 

retrieves the return address from the logical stack; 45 

compares the return address to the speculative target 
instruction address; 

compares the value of the stack pointer register to the 
speculative register value; and 

responsive to the return address matching the specula- 50 
tive target instruction address and to the value of the 
slack pointer register matching the speculative reg- 
ister value, continues execution of the instructions 
corresponding to the speculative target instruction 
address and said instructions with addresses sequen- 55 
tially following said speculative target instruction 
address. 

20. The system of claim 17, wherein the branch prediction 
function comprises: 
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a branch target buffer, having a plurality of entries, each 
entry having a tag portion for storing an identity 
indicator for a branching-type instruction, having a 
target portion for storing the target instruction address 
in association with the tag portion, and having an 
speculative value portion for storing the speculative 
register value in association with the tag and target 
portions. 

21. The system of claim 20, wherein the main memory 
includes a logical stack for storing a return address respon- 
sive to the microprocessor performing a subroutine call 
instruction; 

wherein the microprocessor further comprises a stack 
pointer register, for storing an address corresponding to 
a current memory location in the logical stack; 

wherein the branching type instruction corresponds to a 
subroutine return instruction, so that the target instruc- 
tion address stored in the branch prediction function 
corresponds to the return address; 

and wherein the speculative register value stored in the 
branch prediction function corresponds to the value of 
the stack pointer register. 

22. The method according to claim 20, wherein said 
branching-type instruction is a subroutine call type instruc- 
tion. 

23. The method according to claim 20, wherein said 
branching-type instruction is a subroutine return type 
instruction. 

24. The system of claim 17, wherein the branch prediction 
function comprises: 

a return address stack for storing a speculative return 
address as the target instruction address and for storing 
a speculative register value in association with the 
speculative return address; 

wherein the fetch unit stores the speculative return 
address and associated speculative register value in the 
return address stack responsive to executing an instruc- 
tion of the subroutine call type; 

and wherein the fetch unit retrieves the speculative return 
address and associated speculative register value 
responsive to fetching an instruction of the subroutine 
return type. 

25. The system of claim 24, wherein the main memory 
includes a logical stack for storing a return address respon- 
sive to the microprocessor performing the subroutine call 
instruction; 

wherein the microprocessor further comprises a stack 
pointer register, for storing an address corresponding to 
a current memory location in the logical stack; 

wherein the target instruction address stored in the return 
address stack corresponds to the return address; 

and wherein the speculative register value stored in the 
branch target buffer corresponds to the value of the 
stack pointer register. 

***** 
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